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^T) . The original goal of this project was to explain the striking similarities 

CN \ between inodels produced by the Lasso and Forward Stagewise algorithms, 

as exemplified by Figure 1. LARS, the Least Angle Regression algorithm, 

f-H ', provided the explanation and proved attractive in its own right, its sim- 

^^ ' pie structure permitting theoretical insight into all three methods. In what 

r~| . follows "LAR" will refer to the basic, unmodified form of Least Angle Re- 

"t^ I gression developed in Section 2, while "LARS" is the more general version 

giving LAR, Lasso, Forward Stagewise and other variants as in Section 3.4. 

Here is a summary of the principal properties developed in the paper: 






1. LAR builds a regression model in piecewise linear forward steps, accruing 
^ , explanatory variables one at a time; each step is taken along the equian- 

gular direction between the current set of explanators. The step size is 
T^ij- ' less greedy than classical forward stepwise regression, smoothly blending 

^O I in new variables rather than adding them discontinuously. 

~l ' 2. Simple modifications of the LAR procedure produce all Lasso and For- 

(^ . ward Stagewise solutions, allowing their efficient computation and show- 

ing that these methods also follow piecewise linear equiangular paths. 
The Forward Stagewise connection suggests that LARS-type methods 

a may also be useful in more general "boosting" applications. 

. . I 3. The LARS algorithm is computationally efficient; calculating the full set 

<*' • of LARS models requires the same order of computation as ordinary least 

k> , squares. 

Vh ' 4. A fc-step LAR fit uses approximately k degrees of freedom, in the sense 






C^ 



of added prediction error (4.5). This approximation is exact in the case of 
orthogonal predictors and is generally quite accurate. It permits Cp-type 
stopping rules that do not require auxiliary bootstrap or cross-validation 
computations. 
5. For orthogonal designs, LARS models amount to a succession of soft 
thresholding estimates, (4.17). 
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All of this is rather technical in nature, showing how one might efficiently 
carry out a program of automatic model-building ( "machine learning" ) . Such 
programs seem increasingly necessary in a scientific world awash in huge data 
sets having hundreds or even thousands of available explanatory variables. 

What this paper, strikingly, does not do is justify any of the three algo- 
rithms as providing good estimators in some decision-theoretic sense. A few 
hints appear, as in the simulation study of Section 3.3, but mainly we are 
relying on recent literature to say that LARS methods are at least reason- 
able algorithms and that it is worthwhile understanding their properties. 
Model selection, the great underdeveloped region of classical statistics, de- 
serves careful theoretical examination but that does not happen here. We 
are not as pessimistic as Sandy Weisberg about the potential of automatic 
model selection, but agree that it requires critical examination as well as 
(over) enthusiastic algorithm building. 

The LARS algorithm in any of its forms produces a one-dimensional path 
of prediction vectors going from the origin to the full least-squares solution. 
(Figures 1 and 3 display the paths for the diabetes data.) In the LAR case 
we can label the predictors /i(fc), where k is identified with both the number 
of steps and the degrees of freedom. What the figures do not show is when 
to stop the model-building process and report /i back to the investigator. 
The examples in our paper rather casually used stopping rules based on 
minimization of the Cp error prediction formula. 

Robert Stine and Hemant Ishwaran raise some reasonable doubts about 
Cp minimization as an effective stopping rule. For any one value of A;, Cp is an 
unbiased estimator of prediction error, so in a crude sense Cp minimization 
is trying to be an unbiased estimator of the optimal stopping point fcopt- 
As such it is bound to overestimate /copt in a large percentage of the cases, 
perhaps near 100% if /copt is near zero. 

We can try to improve Cp by increasing the df multiplier "2" in (4.5). 
Suppose we change 2 to some value mult. In standard normal-theory model 
building situations, for instance choosing between linear, quadratic, cubic, 
. . . regression models, the mult rule will prefer model A; + 1 to model k if the 
relevant t-statistic exceeds \mult in absolute value (here we are assuming 
a"^ known); m^ult = 2 amounts to using a rejection rule with a = 16%. Stine's 
interesting Sp method chooses m,ult closer to 4, a = 5%. 

This works fine for Stine's examples, where fcopt is indeed close to zero. 
We tried it on the simulation example of Section 3.3. Increasing m^ult from 
2 to 4 decreased the average selected step size from 31 to 15.5, but with a 
small increase in actual squared estimation error. Perhaps this can be taken 
as support for Ishwaran's point that since LARS estimates have a broad 
plateau of good behavior, one can often get by with much smaller models 
than suggested by Cp minimization. Of course no one example is conclusive 
in an area as multifaceted as model selection, and perhaps no 50 examples 
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either. A more powerful theory of model selection is sorely needed, but until 
it comes along we will have to make do with simulations, examples and bits 
and pieces of theory of the type presented here. 

Bayesian analysis of prediction problems tends to favor much bigger choices 
of mult. In particular the Bayesian information criterion (BIC) uses m^ult = 
log (sample size). This choice has favorable consistency properties, select- 
ing the correct model with probability 1 as the sample size goes to infinity. 
However, it can easily select too-small models in nonasymptotic situations. 

Jean-Michel Loubes and Pascal Massart provide two interpretations using 
penalized estimation criteria in the orthogonal regression setting. The first 
uses the link between soft thresholding and ii penalties to motivate entropy 
methods for asymptotic analysis. The second is a striking perspective on 
the use of Cp with LARS. Their analysis suggests that our usual intuition 
about Cp, derived from selecting among projection estimates of different 
ranks, may be misleading in studying a nonlinear method like LARS that 
combines thresholding and shrinkage. They rewrite the LARS-Cp expres- 
sion (4.5) in terms of a penalized criterion for selecting among orthogonal 
projections. Viewed in this unusual way (for the estimator to be used is 
not a projection!), they argue that m,ult in fact behaves like log(n/fc) rather 
than 2 (in the case of a /c-dimensional projection). It is indeed remark- 
able that this same model-dependent value of m,ult, which has emerged in 
several recent studies [Foster and Stine (1997), George and Foster (2000), 
Abramovich, Benjamini, Donoho and Johnstone (2000) and Birge and Mas- 
sart (2001)], should also appear as relevant for the analysis of LARS. We look 
forward to the further extension of the Birge-Massart approach to handling 
these nondeterministic penalties. 

Cross-validation is a nearly unbiased estimator of prediction error and 
as such will perform similarly to Cp (with m,ult = 2). The differences be- 
tween the two methods concern generality, efficiency and computational 
ease. Cross-validation, and nonparametric bootstrap methods such as the 
632+ rule, can be applied to almost any prediction problem. Cp is more 
specialized, but when it does apply it gives more efficient estimates of pre- 
diction error [Efron (2004)] at almost no computational cost. It applies here 
to LAR, at least when m <n, as in David Madigan and Greg Ridgeway's 
example. 

We agree with Madigan and Ridgeway that our new LARS algorithm may 
provide a boost for the Lasso, making it more useful and attractive for data 
analysts. Their suggested extension of LARS to generalized linear models is 
interesting. In logistic regression, the Li-constrained solution is not piece- 
wise linear and hence the pathwise optimization is more difficult. Madigan 
and Ridgeway also compare LAR and Lasso to least squares boosting for 
prediction accuracy on three real examples, with no one method prevailing. 
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Saharon Rosset and Ji Zhu characterize a class of problems for which 
the coefficient paths, like those in this paper, are piecewise linear. This is a 
useful advance, as demonstrated with their robust version of the Lasso, and 
the £i -regularized Support Vector Machine. The former addresses some of 
the robustness concerns of Weisberg. They also report on their work that 
strengthens the connections between e-boosting and ^i-regularized function 
fitting. 

Berwin Turlach's example with uniform predictors surprised us as well. 
It turns out that 10-fold cross-validation selects the model with |/3i| ~ 45 in 
his Figure 3 (left panel) , and by then the correct variables are active and the 
interactions have died down. However, the same problem with 10 times the 
noise variance does not recover in a similar way. For this example, if the Xj 
are uniform on [—212] I'^-ther than [0,1], the problem goes away, strongly 
suggesting that proper centering of predictors (in this case the interactions, 
since the original variables are automatically centered by the algorithm) is 
important for LARS. 

Turlach also suggests an interesting proposal for enforcing marginality, 
the hierarchical relationship between the main effects and interactions. In 
his notation, marginality says that Pi^j can be nonzero only if /3j and Pj 
are nonzero. An alternative approach, more in the "continuous spirit" of the 
Lasso, would be to include constraints 

\P,..j\<mmmi\p,\}. 

This implies marginality but is stronger. These constraints are linear and, 
according to Rosset and Zhu above, a LARS-type algorithm should be avail- 
able for its estimation. Leblanc and Tibshirani (1998) used constraints like 
these for shrinking classification and regression trees. 

As Turlach suggests, there are various ways to restate the LAR algorithm, 
including the following nonalgebraic purely statistical statement in terms of 
repeated fitting of the residual vector r: 

1. Start with r = y and Pj = Vj. 

2. Find the predictor x^ most correlated with r. 

3. Increase Pj in the direction of the sign of corr(r,Xj) until some other 
competitor x^ has as much correlation with the current residual as does 

4. Update r, and move {Pj,Pk) in the joint least squares direction for the 
regression of r on (xj,Xfc) until some other competitor X£ has as much 
correlation with the current residual. 

5. Continue in this way until all predictors have been entered. Stop when 
corr(r,Xj) =0 Vj, that is, the OLS solution. 
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Traditional forward stagewise would have completed the least-squares step 
at each stage; here it would go only a fraction of the way, until the next 
competitor joins in. 

Keith Knight asks whether Forward Stagewise and LAR have implicit 
criteria that they are optimizing. In unpublished work with Trevor Hastie, 
Jonathan Taylor and Guenther Walther, we have made progress on that 
question. It can be shown that the Forward Stagewise procedure does a 
sequential minimization of the residual sum of squares, subject to 

rt 

(3j{s)ds <t. 



E 



This quantity is the total Li arc-length of the coefficient curve (3{t). If each 
component f3j{t) is monotone nondecreasing or nonincr easing, then Li arc- 
length equals the Li-norm J2j \Pj\- Otherwise, they are different and Li arc- 
length discourages sign changes in the derivative. That is why the Forward 
Stagewise solutions tend to have long flat plateaus. We are less sure of the 
criterion for LAR, but currently believe that it uses a constraint of the form 
E,|/o'/?,(s)ds|<A 

Sandy Weisberg, as a ranking expert on the careful analysis of regres- 
sion problems, has legitimate grounds for distrusting automatic methods. 
Only foolhardy statisticians dare to ignore a problem's context. (For in- 
stance it helps to know that diabetes progression behaves differently after 
menopause, implying strong age-sex interactions.) Nevertheless even for a 
"small" problem like the diabetes investigation there is a limit to how much 
context the investigator can provide. After that one is drawn to the use of 
automatic methods, even if the "automatic" part is not encapsulated in a 
single computer package. 

In actual practice, or at least in good actual practice, there is a cycle 
of activity between the investigator, the statistician and the computer. For 
a multivariable prediction problem like the diabetes example, LARS-type 
programs are a good first step toward a solution, but hopefully not the last 
step. The statistician examines the output critically, as did several of our 
commentators, discussing the results with the investigator, who may at this 
point suggest adding or removing explanatory variables, and so on, and so 
on. 

Fully automatic regression algorithms have one notable advantage: they 
permit an honest evaluation of estimation error. For instance the Cp-selected 
LAR quadratic model estimates that a patient one standard deviation above 
average on BMI has an increased response expectation of 23.8 points. The 
bootstrap analysis (3.16) provided a standard error of 3.48 for this estimate. 
Bootstrapping, jackknifing and cross-validation require us to repeat the orig- 
inal estimation procedure for different data sets, which is easier to do if you 
know what the original procedure actually was. 
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Our thanks go to the discussants for their thoughtful remarks, and to the 
Editors for the formidable job of organizing this discussion. 
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